Cancer Prediction Using Diversity-Based Ensemble Genetic Programming
نویسندگان
چکیده
Combining a set of classifiers has often been exploited to improve the classification performance. Accurate as well as diverse base classifiers are prerequisite to construct a good ensemble classifier. Therefore, estimating diversity among classifiers has been widely investigated. This paper presents an ensemble approach that combines a set of diverse rules obtained by genetic programming. Genetic programming generates interpretable classification rules, and diversity among them is directly estimated. Finally, several diverse rules are combined by a fusion method to generate a final decision. The proposed method has been applied to cancer classification using gene expression profiles, which is one of the important issues in bioinformatics. Experiments on several popular cancer datasets have demonstrated the usability of the method. High performance of the proposed method has been obtained, and the accuracy has increased by diversity among the base classification rules. 1 Introduction Genetic programming is a representative technique in evolutionary computation, which has several distinguished characteristics [1]. Especially, interpretable rules obtained by genetic programming provide not only useful information on classification but also many chances to combine with other approaches. Diversity that is important in ensemble might be directly estimated by comparing the rules [2]. Combining classifiers, known as ensemble, has received the attention to improve classification performance [3,4]. The ensemble classifier is obtained by combining the outputs of multiple classifiers, and the diversity among base classifiers is important besides the accuracy. Diversity implies how differently classifiers are formed, while accuracy represents how correctly a classifier categorizes. Many researchers have studied ensemble techniques and diversity measures. Hansen and Salamon have provided the theoretical basis on ensemble [5], while Opitz and Maclin have performed empirical ensemble experiments comprehensively [6]. Zhou et al. have analyzed the effect on the number of participating classifiers into ensemble in both theoretical and empirical studies [7]. Bagging and boosting have been actively investigated to generate the base classifiers as popular ensemble learning techniques, while various fusion strategies have also been studied for effective ensemble [3,4,8]. A survey on generating diverse classifiers for ensemble has been conducted by Brown [9]. A hybrid model for efficient ensemble was studied by Bakker and Heskes [10], while Tan and Gilbert applied ensemble to classifying gene expression data [11]. Since ensembling the same classifiers does not produce any elevation on performance [8], selecting diverse as well as accurate base classifiers is very important in Cancer Prediction Using Diversity-Based Ensemble Genetic Programming 295 making a good ensemble classifier [9]. Simple ways to generate various classifiers are randomly initializing parameters or making a variation of training data. Bagging (bootstrap aggregating) introduced by Breimen generates individual classifiers by training with a randomly organized set of samples from the original data [12]. Ensemble classifiers with bagging aggregate the base classifiers based on a voting mechanism. Boosting, which is another popular ensemble learning method, is introduced by Schapire to produce a series of classifiers [13]. A set of samples for training a classifier is chosen based on the performance of the previous classifiers in the series. Examples incorrectly classified by previous classifiers have more chances to be selected as training samples for the current one. Arching [14] and Ada-Boosting [13] are the representative boosting methods. Some researchers select diverse classifiers to combine for ensemble [3,8]. Diversity among classifiers is estimated by some measures, and the most discriminating classifiers are selected to make an ensemble classifier. In general, the error patterns of classifiers are used to measure the diversity [9]. Bryll proposed a novel approach that employs different sets of features to generate different classifiers [4]. Most studies aim at generating distinct base classifiers, but they hardly provide explicit methods to measure diversity of classifiers and errors might be included into selecting classifiers. An explicit method estimating the diversity among classifiers might be helpful to minimize errors and to prepare a set of diverse base classifiers. The objective of this paper is to investigate an ensemble approach using genetic programming. Classification rules are generated by genetic programming, while the rules might be interpreted to explicitly measure diversity. A subset of rules is selected based on the diversity to construct an ensemble classifier in which they may be distinct as much as possible from the others. The proposed method is applied to classifying gene expression profiles that is an important problem in bioinformatics. Section 2 describes cancer classification using genetic programming as backgrounds. The proposed method and results are presented in Sections 3 and 4. Conclusion and future work are finally summarized in Section 5. 2 Cancer Classification Using Genetic Programming Cancer classification based on gene expression profiles is one of the major research topics both in the medical field and in machine learning. DNA microarray technology recently developed provides an opportunity to take a genome-wide approach to the correct prediction of cancers. It captures the expression levels of thousands of genes simultaneously which contain information on diseases [15]. Since finding an understandable classification rule is required beside the accuracy, discovering classification rules using genetic programming was studied in the previous work [16]. Even though the rule is quite simple, it shows a good performance in classifying cancers. An individual of genetic programming is represented as a tree that consists of the function set {+, −, ×, /} and the terminal set {f1, f2, , fn, constant} where n is the number of features. The function set is designed to model the up and down regulations of the gene expression. The grammars for the classification rule are: G={V={EXP, OP, VAR}, T={+, −, ×, /, f1, f2, , fn, constant}, P, {EXP}}, where the rule set P is as the following. 296 Jin-Hyuk Hong and Sung-Bae Cho EXP→EXP OP EXP | VAR OP→+|−|×|/ VAR→ f1|f2| |fn|constant The category of an instance is determined by evaluating it with the rule. An instance will be classified into class 1 if the evaluated value is over 0, while it will be classified into class 2 if the value is under 0. Conventional genetic operators for genetic programming are employed for evolution. Crossover randomly selects and changes sub-trees from two individuals, mutation changes a sub-tree into new one, and permutation exchanges two sub-trees of an individual. All genetic operations are conducted according to the predefined probabilities. 3 Combining Classification Rules with Diversity The proposed method consists of 3 processes as shown in Figure 1: selecting features, discovering multiple rules, and selecting and combining the rules. Based on the previous work, Euclidean distance, cosine coefficient and signal-to-noise ratio are employed to score the degree of association of genes with cancers. With the selected genes, genetic programming works to generate multiple classification rules. Diversity among these rules is estimated by the tree edit distance, and a subset of diverse classification rules is used to construct an ensemble classifier. Contrary to conventional ensemble learning methods that simply combine the outputs of individual classifiers, the proposed method picks up some classification rules that maximize diversity.
منابع مشابه
Fault Detection of Bearings Using a Rule-based Classifier Ensemble and Genetic Algorithm
This paper proposes a reduct construction method based on discernibility matrix simplification. The method works with genetic algorithm. To identify potential problems and prevent complete failure of bearings, a new method based on rule-based classifier ensemble is presented. Genetic algorithm is used for feature reduction. The generated rules of the reducts are used to build the candidate base...
متن کاملThe classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming
OBJECT The classification of cancer based on gene expression data is one of the most important procedures in bioinformatics. In order to obtain highly accurate results, ensemble approaches have been applied when classifying DNA microarray data. Diversity is very important in these ensemble approaches, but it is difficult to apply conventional diversity measures when there are only a few trainin...
متن کاملDevelopment of an Ensemble Multi-stage Machine for Prediction of Breast Cancer Survivability
Prediction of cancer survivability using machine learning techniques has become a popular approach in recent years. In this regard, an important issue is that preparation of some features may need conducting difficult and costly experiments while these features have less significant impacts on the final decision and can be ignored from the feature set. Therefore, developing a machine for p...
متن کاملEnsemble Genetic Programming for Classifying Gene Expression Data
Ensemble is a representative technique for improving classification performance by combining a set of classifiers. It is required to maintain the diversity among base classifiers for effective ensemble. Conventional ensemble approaches construct various classifiers by estimating the similarity on the output patterns of them, and combine them with several fusion methods. Since they measure the s...
متن کاملThe ensemble clustering with maximize diversity using evolutionary optimization algorithms
Data clustering is one of the main steps in data mining, which is responsible for exploring hidden patterns in non-tagged data. Due to the complexity of the problem and the weakness of the basic clustering methods, most studies today are guided by clustering ensemble methods. Diversity in primary results is one of the most important factors that can affect the quality of the final results. Also...
متن کامل